kubernetes cluster
Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
Zhang, Boxuan, Yu, Yi, Guo, Jiaxuan, Shao, Jing
The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate (OR) and Aggregate Overuse Count (AOC) metrics, which precisely capture the frequency and severity of uncontrolled replication. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents. The rapid advancement of large language models (LLMs) has propelled LLM agents into widespread deployment in various domains, including code generation, web-based application (Maslej et al., 2025; He et al., 2025a;c). As LLM agents take on critical tasks and interact with complex environments, they are often granted extensive operational permissions. While this combination of increased capability and operational permissions offers transformative potential, it also raises safety concerns (OpenAI, 2024b; Anthropic, 2023; Betley et al., 2025). Researchers are worried about the emerging safety risks of LLM agents' self-replication (OpenAI, 2024a; 2025; Black et al., 2025). Prior studies on LLM self-replication risks have mainly focused on measuring the capability (verbalized success rate) of self-replication, either through direct instructions or within synthetic capability benchmarks (Pan et al., 2024; 2025; Kran et al., 2025; Black et al., 2025).
- Media > Film (0.54)
- Leisure & Entertainment (0.54)
- Information Technology > Security & Privacy (0.34)
Enabling Secure and Ephemeral AI Workloads in Data Mesh Environments
Many large enterprises that operate highly governed and complex ICT environments have no efficient and effective way to support their Data and AI teams in rapidly spinning up and tearing down self-service data and compute infrastructure, to experiment with new data analytic tools, and deploy data products into operational use. This paper proposes a key piece of the solution to the overall problem, in the form of an on-demand self-service data-platform infrastructure to empower de-centralised data teams to build data products on top of centralised templates, policies and governance. The core innovation is an efficient method to leverage immutable container operating systems and infrastructure-as-code methodologies for creating, from scratch, vendor-neutral and short-lived Kubernetes clusters on-premises and in any cloud environment. Our proposed approach can serve as a repeatable, portable and cost-efficient alternative or complement to commercial Platform-as-a-Service (PaaS) offerings, and this is particularly important in supporting interoperability in complex data mesh environments with a mix of modern and legacy compute infrastructure.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Oceania > Australia (0.04)
- (5 more...)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.93)
- (3 more...)
- Information Technology > Software > Programming Languages (1.00)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Cloud Computing (1.00)
- (10 more...)
Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework
Soulé, Julien, Jamont, Jean-Paul, Occello, Michel, Traonouez, Louis-Marie, Théron, Paul
--In cloud-native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial-of-Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning-based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents, collectively forming an HPA Multi-Agent System (MAS). We introduce an automated, four-phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster . Experimental results demonstrate that the generated HPA MASs outperform three state-of-the-art HPA systems in sustaining operational resilience under various adversarial conditions in a proposed complex cluster . Cloud-native critical systems are increasingly reliant on Kubernetes to orchestrate and manage interdependent services [1]. HP A is a widely adopted mechanism to dynamically adjust the number of pods based on resource usage, enabling systems to handle highly dynamic workloads [2]. However, failures such as pod crashes, resource contention, and bottlenecks can severely jeopardize the performance of all of the cluster's functionalities we globally refer to as operational resilience [3]. Worse, these failures may be exploited by attackers to degrade performance or induce outages, as seen in adversarial contexts like DDoS attacks [4].
- North America > United States (0.14)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Cloud-Based Scheduling Mechanism for Scalable and Resource-Efficient Centralized Controllers
Seisa, Achilleas Santi, Satpute, Sumeet Gajanan, Nikolakopoulos, George
This paper proposes a novel approach to address the challenges of deploying complex robotic software in large-scale systems, i.e., Centralized Nonlinear Model Predictive Controllers (CNMPCs) for multi-agent systems. The proposed approach is based on a Kubernetes-based scheduling mechanism designed to monitor and optimize the operation of CNMPCs, while addressing the scalability limitation of centralized control schemes. By leveraging a cluster in a real-time cloud environment, the proposed mechanism effectively offloads the computational burden of CNMPCs. Through experiments, we have demonstrated the effectiveness and performance of our system, especially in scenarios where the number of robots is subject to change. Our work contributes to the advancement of cloud-based control strategies and lays the foundation for enhanced performance in cloud-controlled robotic systems.
- Europe > Sweden > Norrbotten County > Luleå (0.04)
- Europe > Sweden > Skåne County > Lund (0.04)
- Research Report (0.70)
- Overview (0.66)
RoboKube: Establishing a New Foundation for the Cloud Native Evolution in Robotics
Liu, Yu, Herranz, Aitor Hernandez, Sundin, Roberto C.
Cloud native technologies have been observed to expand into the realm of Internet of Things (IoT) and Cyber-physical Systems, of which an important application domain is robotics. In this paper, we review the cloudification practice in the robotics domain from both literature and industrial perspectives. We propose RoboKube, an adaptive framework that is based on the Kubernetes (K8s) ecosystem to set up a common platform across the device-cloud continuum for the deployment of cloudified Robotic Operating System (ROS) powered applications, to facilitate the cloud native evolution in robotics. We examine the process of modernizing ROS applications using cloud-native technologies, focusing on both the platform and application perspectives. In addition, we address the challenges of networking setups for heterogeneous environments. This paper intends to serves as a guide for developers and researchers, offering insights into containerization strategies, ROS node distribution and clustering, and deployment options. To demonstrate the feasibility of our approach, we present a case study involving the cloudification of a teleoperation testbed.
- Europe > Sweden > Stockholm > Stockholm (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Overview (0.48)
- Research Report (0.40)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
Deploy ML models at the edge with Microk8s, Seldon and Istio
Edge computing is defined as solutions that move data processing at or near the point of data generation. This means that the results of Machine Learning model inference can be delivered to customers faster and create a real-time inference feeling. This is a perfect place for your models. Looking at Gartner's prediction: "Around 10% of enterprise-generated data is created and processed outside a traditional centralized data centre or cloud. By 2025, Gartner predicts this figure will reach 75%".
Understand Workflow Management with Kubeflow
Kubeflow is an open-source platform that makes it easy to deploy and manage machine learning (ML) workflows on Kubernetes, a popular open-source system for automating containerized applications' deployment, scaling, and management. Kubeflow can help you run machine learning tasks on your computer by making it easy to set up and manage a cluster of computers to work together on the task. It acts like a "traffic cop" for your computer work, ensuring all the tasks' different steps are done in the right order and that all the computers are working together correctly. This way, you can focus on the task at hand, such as making predictions or finding patterns in your data, and let Kubeflow handle the underlying infrastructure. Imagine you have a big toy box with many different toys inside. Kubeflow is like the toy box organizer.
2023-01-11: 2023 trends: AI/MLOps, eBPF, OpenTelemetry, SBOMs everywhere; GPT3-visualized, DORA metrics, Keptn Lifecycle Toolkit, Fluxninja Aperture, Coroot, Rust Atomics and Locks book, Zed - opsindev.news
Thanks for reading the web version, you can subscribe to the Ops In Dev newsletter here. Happy new year to everyone who celebrates it! I'll cover the best learning pieces in this newsletter and invest in learning hot topics like AI/ML. I started my year early on January 2nd, and boom, a CI/CD pipeline failed with a fancy stack trace. Got me thinking - what if AI could assist with solving pipeline errors for better efficiency?
- Information Technology > Cloud Computing (1.00)
- Information Technology > Communications > Social Media (0.78)
- Information Technology > Artificial Intelligence (0.71)
Industry-Scale Orchestrated Federated Learning for Drug Discovery
Oldenhof, Martijn, Ács, Gergely, Pejó, Balázs, Schuffenhauer, Ansgar, Holway, Nicholas, Sturm, Noé, Dieckmann, Arne, Fortmeier, Oliver, Boniface, Eric, Mayer, Clément, Gohier, Arnaud, Schmidtke, Peter, Niwayama, Ritsuya, Kopecky, Dieter, Mervin, Lewis, Rathi, Prakash Chandra, Friedrich, Lukas, Formanek, András, Antal, Peter, Rahaman, Jordon, Zalewski, Adam, Heyndrickx, Wouter, Oluoch, Ezron, Stößel, Manuel, Vančo, Michal, Endico, David, Gelus, Fabien, de Boisfossé, Thaïs, Darbier, Adrien, Nicollet, Ashley, Blottière, Matthieu, Telenczuk, Maria, Nguyen, Van Tien, Martinez, Thibaud, Boillet, Camille, Moutet, Kelvin, Picosson, Alexandre, Gasser, Aurélien, Djafar, Inal, Simon, Antoine, Arany, Ádám, Simm, Jaak, Moreau, Yves, Engkvist, Ola, Ceulemans, Hugo, Marini, Camille, Galtier, Mathieu
To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n{\deg}831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- (9 more...)
Towards a Dynamic Composability Approach for using Heterogeneous Systems in Remote Sensing
Altintas, Ilkay, Perez, Ismael, Mishin, Dmitry, Trouillaud, Adrien, Irving, Christopher, Graham, John, Tatineni, Mahidhar, DeFanti, Thomas, Strande, Shawn, Smarr, Larry, Norman, Michael L.
Influenced by the advances in data and computing, the scientific practice increasingly involves machine learning and artificial intelligence driven methods which requires specialized capabilities at the system-, science- and service-level in addition to the conventional large-capacity supercomputing approaches. The latest distributed architectures built around the composability of data-centric applications led to the emergence of a new ecosystem for container coordination and integration. However, there is still a divide between the application development pipelines of existing supercomputing environments, and these new dynamic environments that disaggregate fluid resource pools through accessible, portable and re-programmable interfaces. New approaches for dynamic composability of heterogeneous systems are needed to further advance the data-driven scientific practice for the purpose of more efficient computing and usable tools for specific scientific domains. In this paper, we present a novel approach for using composable systems in the intersection between scientific computing, artificial intelligence (AI), and remote sensing domain. We describe the architecture of a first working example of a composable infrastructure that federates Expanse, an NSF-funded supercomputer, with Nautilus, a Kubernetes-based GPU geo-distributed cluster. We also summarize a case study in wildfire modeling, that demonstrates the application of this new infrastructure in scientific workflows: a composed system that bridges the insights from edge sensing, AI and computing capabilities with a physics-driven simulation.
- North America > United States > California > San Diego County > San Diego (0.06)
- North America > United States > California > San Diego County > La Jolla (0.06)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Workflow (0.93)
- Research Report (0.84)